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ABSTRACT 

We describe an image analysis supervised learning algorithm that can automati- 
cally classify galaxy images. The algorithm is first trained using a manually classified 
images of elliptical, spiral, and edge-on galaxies. A large set of image features is ex- 
tracted from each image, and the most informative features are selected using Fisher 
scores. Test images can then be classified using a simple Weighted Nearest Neighbor 
rule such that the Fisher scores are used as the feature weights. Experimental results 
show that galaxy images from Galaxy Zoo can be classified automatically to spiral, 
elliptical and edge-on galaxies with accuracy of ~90% compared to classifications car- 
ried out by the author. Full compilable source code of the algorithm is available for 
free download, and its general-purpose nature makes it suitable for other uses that 
involve automatic image analysis of celestial objects. 

Key words: Methods: data analysis - Techniques: image processing. 
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1 INTRODUCTION 

In the past several years autonomous sky surveys have 
been becoming increasingly important, and large datasets 
of astronomical images have been generated and become 
available by these ventures. The availability of these large 
datasets has introduced the need for tools that can automat- 
ically analyze astronomical images. This includes the need 
for automatic morphological classification of celestial objects 
that appear inside an astronomical frame. 

One approach to classification of large sets of galaxy 
images, which was successfu lly adopted by the Galaxy Zoo 
project |Lintott et ahlfioosl v allows hobbyist volunteers to 
log- in and manually classify galaxies via the project web 
site. The galaxy images are acquired by the Sloan Digital 
Sky Survey (SDSS), and displayed by Galaxy Zoo as JPEG 
i mages scaled by 0.024i?p, where Rp is the Petrosian radius 
(|Petrosianlll976[ ) for the galaxy. 

While each volunteer can classify just a limited number 
of galaxies, the efficacy of the data analysis is enabled by 
the availability of a very large number of human observers. 
However, the bottleneck introduced by the manual analysis 
limits the ability of this method to provide quick analysis of 
massive galaxy datasets. 

Here we describe a software tool that can be used for 
automatic classification of galaxy images. The algorithm was 
originally developed for automatic analysis of cell morphol- 
ogy, but its general-purpose design allows it to be effective 
for appUcations outside the scope of cell biology. Full com- 
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pilable source code can be freely downloaded. In Section [2] 
we briefly describe the algorithm, and in Section [3] the ex- 
perimental results are discussed. 



2 IMAGE ANALYSIS METHOD 

The image analysis algorithm used for the automatic galaxy 
image classification is WND-GHARM (Shamir et al.l l2008al : 
Orlov et al. 2008), which was originally designed for auto- 
matic analysis of cell and tissue images, but also demon- 
strated efficac y as a general-purp ose image analysis tool 
(Shamir 2008; Shamir et al]l2009ah . WND-GHARM ^ist re- 
duces each image to a total of 2873 numerical low-level 
descriptors (when the "-1" switch in the command line is 
turned on, which indicates that the larger set of image fea- 
tures should be computed). These generic image features 
include high-contrast features (object statistics, edge statis- 
tics, Gabor filters), textures (Haralick, Tamura), statisti- 
cal distribution of the pixel values (multi-scale histogram, 
first four moments), factors from polynomial decomposi- 
tion of the image (Chebyshev statistics, Chebyshev-Fourier 
statistics, Zernike polynomials), Radon features and frac- 
tal features. A detailed de scription of these i mage content 
descriptors is availa bl e in (lOrlov et al.ll2008l : IShamirl [ioosi : 
IShamir et al.ll2008al E [2009al lbl). To extend the number and 
variety of the image features, these algorithms are applied 
not only to the raw image pixels, but also to several trans- 
forms of the image such as Fourier, Chebyshev, Wavelet, 
and edge- mag nitude transform, as well as tandem transform 
combinations (|Shamir et al.ll2008al : IShamirll2008h . 

After image content descriptors for all images in the 
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Table 1. Confusion matrix of the classification of el- 
liptical and spiral galaxies 





Elliptical 


Spiral 


Elliptical 


1445 


55 


Spiral 


150 


1350 



training dataset are computed, each of the 2873 features is 
assigned a Fisher discriminant score (lBishoJ 2006). and 85% 
of the features with the lowest Fisher scores are rejected in 
order to filter non- informative image features. The distance 
between two image feature vectors Xand Yean then be com- 
puted by using a simple Weighted Nearest Neighbor rule, as 
described by Equation [T] 



1^1 

j2wf{Xf-Yfr, 
/=i 



(1) 



where Wf is the assigned Fisher score of feature /, 
and d is the computed weighted distance between the 
two feature vectors. The predicted class of a given test 
image is simply determined by the class of the train- 
ing image that has the shortest weighted distance d 
to the test image. A compilable open-source of the 
WND-CHARM algorithm is available for free download at 
http://www.cs.mtu.edu/~lshamir/downloads/ImageClassifier 



3 EXPERIMENTAL RESULTS 

The method was tested using a dataset of spiral and el- 
liptical galaxy images taken from Galaxy Zoo and classi- 
fied manually by the author. The galaxies in the dataset 
were selected randomly by the Galaxy Zoo web interface, 
and no attempt to normalize for luminosity, size or distance 
was made. Unclear cases were classified by the judgment 
of the author. This study, however, ignored Galaxy Zoo 
monochrome images, that we re introduced by the Galaxy 
Zoo bias study (^Lintott et al.L 2008). Although only colour 
images were used, no colour features were used in this study. 

The 120x120 pixel block at the centre of each galaxy 
image was separated from the image and converted into 
lossless TIFF image format, from which image content 
descriptors were computed. The dataset includes images of 
247 spiral galaxies (one is repeated), 215 elliptical galax- 
ies, and 107 edge-on galaxies, and can be downloaded at 
http:/ /www. cs.mtu.edu/~lshamir/downloads/galaxies.tar.gz 

In the first experiment, the image classification method 
was used to classify between spiral and elliptical galaxies. 
One hundred and fifty images from each class (spiral and 
elliptical) were used for training, and 50 images for test- 
ing (by specifying the "-il50" and "-j50" parameters in the 
wndchrm command line). The experiment was repeated 30 
times such that in each run different images were selected 
randomly from the pool of images and were allocated for 
the training and test sets. The results show that ~93% of 
the galaxy images were classified correctly to elliptical and 
spiral galaxies, as can be learned from the confusion matrix 
of Tabled 

While WND-GH ARM computes a large set of image fea- 
tures, not all features are expected to be equally informative. 



and some are expected to represent noise. The estimated in- 
formativeness of the different image content descriptors is 
described by Figure [1] which shows the sum of the Fisher 
scores of all bins of the different fe ature groups extract ed 
from the different image transforms (jShamir et al.ll2008al ). 

While some of the informative image features are 
highly non-intuitive, such as the Haralick texture features 
(Haralick, Shanmugam & Dinstein] 119731) computed from 
the Chebyshev image transform, other image content de- 
scriptors are easier to conceptualiz e. For instance, the fracta l 
features used by WND-GH ARM ("W u. Chen Hsiehlll992h 
can become informative by sensing the fractal characteristics 
of the shape of a spiral galaxy, which are not expected to ex- 
ist in an elliptical galaxy. The fractality of spiral galaxies can 
often be sensed easily by the unaided ey e. One example is the 
picture of the MlOl "pinwheel" galaxy (|Nemiroff Sz Bonnelll 
(2009), in which some of the arms split into secondary arms, 
which then split again to smaller arms. When using the frac- 
tal features alone, the classification accuracy between the 
spiral and elliptical galaxies is '^76%, which demonstrates 
the informativeness of the fractal features for galaxy mor- 
phology. 

Clearly, the MlOl picture taken by Hubble Space Tele- 
scope is much more detailed than the galaxy images acquired 
by SDSS. However, while the fractality signal is obviously 
weaker in the Sloan images, it still exists. Fractal analysis 
methods can very often detect fractality that is very difficult 
to sense by the unaided eye, and i s sensitive to even subtle 
fractal patterns (jMandelbrotll 198^ 1 . Therefore, the informa- 
tiveness of the fractal features for detecting spiral galaxies 
in the small-scale SDSS images cannot be considered sur- 
prising. 

O ther informat ive features include the Zernike polyno- 
mials (|Teaguelll98dl ). which are also expected to be informa- 
tive due to their radial nature that allows them to reflect 
variations in the unit disk. Since the unit disk is definitely 
a fundamental and obvious difference between spiral and el- 
liptical galaxies, Zernike polynomials are expected to reflect 
differences between these types of galaxies. Zernike polyno- 
mial features can also be used for classification between true 
elliptical galaxies and SO galaxies that have a disk, which are 
a major source of confusion in morphological classification 
of galaxies. When only the Zernike features are used for the 
classification, the accuracy is ^71%. 

In addition to the predicted class of the galaxy (spiral 
or elliptical), the classifier also returns the similarity of the 
tested galaxy images to each of the classes, measured by the 
distances between the feature vectors as described in Sec- 
tion [21 normalized to the interval (0,1). For instance, for 
a galaxy that is clearly spiral the similarity values to the 
classes spiral and elliptical are expected to be relatively dif- 
ferent from each other such as 0.65 and 0.35, respectively. 
For a galaxy that does not have an obvious typical spiral 
shape, the two values are expected to be more similar, such 
as 0.52 and 0.48. These similarity values can provide ad- 
ditional information about the morphology of the galaxies 
which aims to measure how spiral or elliptical they are. Ta- 
ble [2] shows some sample galaxy images with their computed 
similarity values and the automatic and manual classifica- 
tions, including some cases of disagreement between the au- 
thor's and the automatic classification. While the estimated 
similarity for a single image is not always accurate, large 
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Figure 2. Classification accuracy when excluding galaxy images 
with similarity lower than a certain threshold 



Table 3. Confusion matrix of the classification of 
edge-on and spiral galaxies 





Edge-on 


Spiral 


Edge-on 


929 


71 


Spiral 


85 


915 



sets of images of each type can allow quantit ative analysis 
of the similar ity between the different classes (| Shamir et al.l 
l2008al . l20Q9al ). 

The similarity value can also be used as an indication 
of the certainty of a galaxy image classification, i.e., a clas- 
sification of a galaxy image with a high similarity value to 
a certain morpohlogical type can be considered more cer- 
tain than a classification in which the similarity value is 
slightly greater than 0.5. Figure [2] shows how the classifica- 
tion accuracy responds to threshold similarity values. As the 
figure shows, all galaxy classifications with similarity values 
greater than 0.58 were classified correctly, and the classifi- 
cation accuracy is ^98.5% for galaxy image classifications 
with similarity values greater than 0.54. 

The amount of galaxy images that correspond to the 
threshold similarity values is shown in Figure [S] As the 
figure shows, ^50% of all galaxy images can be classified 
with accuracy greater than 99.5%, and ^80% of the galax- 
ies with accuracy greater than 97%. When counting also 
the galaxy image classifications that have similarity values 
slightly higher than 0.5 the classification accuracy drops be- 
low 94%. 

An additional experiment tested whether the image 
classifier described in Section [2] can classify between edge-on 
and spiral galaxies. For this purpose, 80 images of edge-on 
galaxies and a similar number of spiral galaxies were used for 
training, and 20 images from each class for testing (by using 
the "-180" and "-j20" parameters of the wndchrm command 
line). The experiment was repeated 50 times, such that in 
each run images were allocated randomly to training and 
test sets. The results show that 92% of the images were 
classified correctly to spiral and edge-on galaxies, as can be 
learned from the confusion matrix of Table [S] 
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Figure 3. Classification accuracy as a function of the number of 
galaxies with the highest similarity values 



Table 4. Confusion matrix of the classification of 
edge-on and elliptical galaxies 





Edge on 


Elliptical 


Edge on 


976 


24 


Elliptical 


8 


992 



A similar experiment tested whether the image classi- 
fier can classify between elliptical and edge-on galaxies. In 
this experiment, the dataset included 100 images of ellipti- 
cal galaxies and 100 images of edge-on galaxies, such that 
80 images from each set were used for training and 20 for 
testing. As before, the experiment was repeated 50 times, 
and the average classification accuracy was 98% as shown 
by the confusion matrix of Table 2] 

The accuracy of a three-way classifier for all three 
classes together (spiral, elliptical and edge-on galaxies) is 
90%. This was determined by 30 runs, such that in each run 
80 images of each of the three classes were randomly selected 
for training, and 20 images for testing. The confusion matrix 
of the experiment is described by Table [5l ^ 

As discussed in (|Orlov e t al. 2008': IShamir et al. 2009a). 
the accuracy of the image classifier is not very sensitive to 
the number of the image features due to the use of the fea- 
ture weights when computing the distances between feature 
vectors. Rejecting the weakest 85% of the feat ures when us- 
ing the larger feature set of the wndchrm tool (jShamir et al.l 
2008a) is often a reasonable starting point, and in many 
cases other values (set by using the "-f ' option in the com- 
mand line) do not improve the performance significantly. 

Table 5. Confusion matrix of the classification of 
edge-on, elliptical, and spiral galaxies 





Edge-on 


Elliptical 


Spiral 


Edge-on 


551 
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42 


Elliptical 
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561 
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38 


47 


515 
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Table 2. Automatic and author classification of sample Galaxy Zoo galaxy images 



Image Galaxy Zoo ID Similarity values Automatic Author's 

(elliptical/spiral) classification classification 

^^^^1 588023669702131872 0.547/0.453 

^^^^1 587739380994998479 0.562/0.438 

^^^^1 0.520/0.480 

^^^^1 587742783673991349 0.513/0.487 

^^^^1 587742575925657806 0.466/0.534 

^^^^1 0.436/0.564 

^^^^1 587736585508094159 0.416/0.584 

^^^^M 588009366939238541 0.488/0.512 

^^^^1 587735697522229435 0.508/0.492 

^^^^1 587727221950447853 0.504/0.496 

^^^^1 587741600950452406 0.510/0.490 

^^^^1 588016878292762809 0.502/0.498 

^^^^1 588015509808152736 0.427/0.573 
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Figure 4. Classification accuracy as a function of the size of the 
feature set 

Figure 2] shows how the classification accuracy changes as 
more features are used. 

As the figure shows, the classification accuracy increases 
as the number of used features gets larger, and starts to 
decrease when more than 15% of the features are used due 
to the increasing effect of noisy features. 

A major downside of the proposed algorithm is its com- 
putational complexity. The extraction of a large number of 
431 image features (15% of 2873) from each image is a com- 
putationally intensive task, so that computing the image fea- 
tures for a single image takes ~35 seconds using a system 
with a 2GHZ Intel Processor and 2GB of RAM. However, 
the step of image feature extraction can be parallelized with 
a very low overhead (Shamir et al. 2008a), so that several 
processors can compute the same dataset, reducing the re- 
sponse time of the system almost linearly to the number 
of processors. The classification of the feature values can- 
not be parallelized without changing the software, but the 
computational cost of this step is negligible. 

The performance of the proposed image analy- 
sis method was comp ared to galaxy classification us- 
ing t he Gini coefficient ([Abraham, Van Den Bergh Nairl 
l2003l ). This was done by using the morph command- 
line utility, which is part of the M orpheus package 
([Abraham Van Den Bergh Nai'3 l2003l l . The same set of 
galaxy images was used, but the images were converted into 
FITS format, which is the native input format of morph. 
Results show that in ^77% of the cases the Gini coefficient 
accurately determined whether a galaxy is elliptical or spiral, 
and Table [6] shows the confusion matrix of the classification. 
The agreements between the method proposed in this paper 
and the Gini coefficient method is ^75% for the elliptical 
galaxies, and ^66% for the spiral galaxies. Clearly, there 
is a better degree of agreement between the two methods 
on elliptical galaxies comparing to spiral galaxies. When us- 
ing the Gini coefficient method for classifying between the 
three types of galaxies (elliptical, spiral and edge-on), the 
classification accuracy is ^55%, as can be learned from the 
confusion matrix of Table [71 

It should be noted that the computational cost of the 
Gini coefficient, which is practically negligible, makes it dra- 



Table 6. Confusion matrix of classification of elliptical 
and spiral galaxies using the Gini coefficient 





Elliptical 


Spiral 


Elliptical 


157 
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Spiral 


51 


149 



Table 7. Confusion matrix of the classification of 
edge-on, elliptical, and spiral galaxies using the Gini 
coefficient 





Edge-on 


Elliptical 


Spiral 


Edge-on 


52 


31 


17 


Elliptical 


28 


55 


17 


Spiral 


22 


20 


58 



matically faster than computing the set of image features 
used in this paper. It should also be noted that the Gini coef- 
ficient performed better than any other single ima ge content 
descriptor included in the tested feature set ([Shamir et alJ 
l2008al ). 



4 CONCLUSIONS 

Here we described an algorithm that can automatically clas- 
sify between images of spiral, elliptical, and edge-on galax- 
ies. The galaxy dataset features a random collection of 
galaxy images. Since luminosity, size and distance was found 
highly important for t he automatic classification of galaxies 
("Bamford et all 120091 ). it can be assumed that the classi- 
fication accuracy can be improved when using datasets of 
nearby, large or bright galaxies. Since the described super- 
vised machine learning method can be used for general pur- 
pose image classification, it is reasonable to assume that the 
same utility can be used for other problems in morphological 
analysis of celestial objects. 

The native format of Sloan images is FITS. Since the 
conversion from FITS format to lossy JPEG requires the 
sacrifice of image information, it can be assumed that di- 
rect access to the raw Sloan image files can potentially lead 
to a better performance, especially in cases of subtle dif- 
ferences of pixel intensity. Researchers are therefore advised 
to take this issue under consideration when applying WND- 
CHARM to problems in automatic galaxy morphology in 
which the differences between the galaxies are more difficult 
to notice by the unaided eye. 

The dataset used for the described experiments consists 
of galaxy images manually classified by the author. Since 
supervised learning is used, the classifier can be biased by 
the intuition of the person(s) who prepare the gold standard 
training data. Therefore, training data for an image classifier 
that can be used for practical galaxy morphology classifica- 
tion should be selected and reviewed carefully. Even if the 
selection of the data follows a different intuition than the 
author's, as long as the classification criteria are consistent 
for all images the supervised learning is expected to pro- 
vide performance figures that are comparable to the results 
reported in this paper. 
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Interestingly, many of the challenges of automatic mor- 
phology analysis of galaxies appear to be quite similar to 
automatic analysis of cell morphology. For instance, the in- 
terest in automatic detection of binucleate galaxies, which 
indicate that the two galaxies are being merged, coincides 
with the interest in automatic detection of binucleate cells, 
which means that the cell failed to complete the process of 
mitosis (e.g., Gl arrest). Another example is the interest 
in peculiar galaxies, which coincides with the interest in af- 
fected cells or unexpected phenotypes that are found among 
very many regular cells. 

One of the major advantages of the algorithm is that 
its full source code is available for free downl oad as a compi- 
lable software package (Shamir et al.|[2008al ) that has been 
tested for robustness and correctness, and researchers who 
have basic computer skills can easily use the application as 
a command line utility. Therefore, in cases where there is 
a need for computer-based morphological analysis we en- 
courage scientists to try WND-CHARM before taking the 
labour-intensive challenge of designing, developing and test- 
ing new task-specific image classifiers. 

Applications of this method to galaxy classification in- 
clude fully automatic analysis of galaxies, but it can also 
be used as a decision-supporting tool for datasets that are 
classified manually such as Galaxy Zoo. 
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